Exercise 11
Set seed
Load packages
Question 1: Loading data
ExperimentHub with 21 records
# snapshotDate(): 2023-04-24
# $dataprovider: NA, Bodenmiller Group
# $species: Homo sapiens
# $rdataclass: flowSet, SummarizedExperiment, data.frame
# additional mcols(): taxonomyid, genome, description,
# coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
# rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["EH2254"]]'
title
EH2254 | Bodenmiller_BCR_XL_SE
EH2255 | Bodenmiller_BCR_XL_flowSet
EH3047 | Weber_BCR_XL_sim_main_SE
EH3048 | Weber_BCR_XL_sim_main_flowSet
EH3049 | Weber_BCR_XL_sim_null_rep1_SE
... ...
EH3061 | Weber_BCR_XL_sim_less_distinct_less_50pc_SE
EH3062 | Weber_BCR_XL_sim_less_distinct_less_50pc_flowSet
EH3063 | Weber_BCR_XL_sim_less_distinct_less_75pc_SE
EH3064 | Weber_BCR_XL_sim_less_distinct_less_75pc_flowSet
EH3564 | Damond Diabetes Spatial Dataset
Bodenmiller_BCR_XL_flowSet is associated with query ID EH2255.
A flowSet with 16 experiments.
column names(39): Time Cell_length ... sample_id population_id
[1] "Time" "Cell_length" "CD3(110:114)Dd" "CD45(In115)Dd"
[5] "BC1(La139)Dd" "BC2(Pr141)Dd" "pNFkB(Nd142)Dd" "pp38(Nd144)Dd"
[9] "CD4(Nd145)Dd" "BC3(Nd146)Dd" "CD20(Sm147)Dd" "CD33(Nd148)Dd"
[13] "pStat5(Nd150)Dd" "CD123(Eu151)Dd" "pAkt(Sm152)Dd" "pStat1(Eu153)Dd"
[17] "pSHP2(Sm154)Dd" "pZap70(Gd156)Dd" "pStat3(Gd158)Dd" "BC4(Tb159)Dd"
[21] "CD14(Gd160)Dd" "pSlp76(Dy164)Dd" "BC5(Ho165)Dd" "pBtk(Er166)Dd"
[25] "pPlcg2(Er167)Dd" "pErk(Er168)Dd" "BC6(Tm169)Dd" "pLat(Er170)Dd"
[29] "IgM(Yb171)Dd" "pS6(Yb172)Dd" "HLA-DR(Yb174)Dd" "BC7(Lu175)Dd"
[33] "CD7(Yb176)Dd" "DNA-1(Ir191)Dd" "DNA-2(Ir193)Dd" "group_id"
[37] "patient_id" "sample_id" "population_id"
Out of the 39 columns, we have a total of 33 recorded markers (with suffix Dd), out of which 9 are neither type nor state markers (as seen later); these are not considered any further.
The flowSet contains 16 experiments, i.e., there are a total of 16 samples.
num_cells <- rep(0, 16)
names(num_cells) <- rownames(phenoData(data_fs))
for (i in 1:16) {
num_cells[i] <- dim(exprs(data_fs[[i]]))[1]
}
num_cells PBMC8_30min_patient1_BCR-XL.fcs PBMC8_30min_patient1_Reference.fcs
2838 2739
PBMC8_30min_patient2_BCR-XL.fcs PBMC8_30min_patient2_Reference.fcs
16675 16725
PBMC8_30min_patient3_BCR-XL.fcs PBMC8_30min_patient3_Reference.fcs
12252 9434
PBMC8_30min_patient4_BCR-XL.fcs PBMC8_30min_patient4_Reference.fcs
8990 6906
PBMC8_30min_patient5_BCR-XL.fcs PBMC8_30min_patient5_Reference.fcs
8543 11962
PBMC8_30min_patient6_BCR-XL.fcs PBMC8_30min_patient6_Reference.fcs
8622 11038
PBMC8_30min_patient7_BCR-XL.fcs PBMC8_30min_patient7_Reference.fcs
14770 15974
PBMC8_30min_patient8_BCR-XL.fcs PBMC8_30min_patient8_Reference.fcs
11653 13670
The first pair of samples (from patient 1) have only around 2800 cells each. Except this pair, all other samples have around 7000 to 17000 cells.
Question 2: Constructing a SingleCellExperiment
class: SingleCellExperiment
dim: 24 172791
metadata(2): experiment_info chs_by_fcs
assays(2): counts exprs
rownames(24): CD3 CD45 ... HLA-DR CD7
rowData names(3): channel_name marker_name marker_class
colnames: NULL
colData names(3): sample_id condition patient_id
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
[1] 24 172791
There are 24 markers recorded. The total number of cells over all samples is 172791.
Question 3: Type and state markers
[1] "CD3" "CD45" "CD4" "CD20" "CD33" "CD123" "CD14" "IgM"
[9] "HLA-DR" "CD7"
[1] "pNFkB" "pp38" "pStat5" "pAkt" "pStat1" "pSHP2" "pZap70" "pStat3"
[9] "pSlp76" "pBtk" "pPlcg2" "pErk" "pLat" "pS6"
Some markers that seem to violate the assumption that expressions are fairly consistent across conditions are CD7 (higher in stimulated), CD4 (higher in reference), HLA-DR (higher in stimulated), and CD20 (higher in reference). Additionally, CD3 and CD14 also show slightly variable expression across samples but not systematically between conditions. The only markers which are consistently expressed similarly across samples and conditions are CD45, CD123, CD33, and IgM.
Question 4: Clustering
# cell counts per meta8 cluster per sample
dcast(data.frame(sample_id=sample_ids(sce), cluster_id=cluster_ids(sce, "meta8")), sample_id~cluster_id) sample_id 1 2 3 4 5 6 7 8
1 BCRXL1 763 22 122 612 163 965 141 50
2 BCRXL2 5948 473 860 2354 1367 4933 392 348
3 BCRXL3 3921 645 1558 2218 1108 1747 739 316
4 BCRXL4 2811 595 1165 1663 896 1235 434 191
5 BCRXL5 3162 332 978 1233 636 1575 453 174
6 BCRXL6 3807 499 1096 743 586 1226 419 246
7 BCRXL7 3925 438 678 3825 1931 3014 677 282
8 BCRXL8 3640 444 790 1557 1795 2644 478 305
9 Ref1 1215 15 224 422 157 461 175 70
10 Ref2 7864 291 1834 1619 1674 2431 651 361
11 Ref3 3215 268 1547 1406 1066 712 1029 191
12 Ref4 2034 377 1728 1063 605 616 387 96
13 Ref5 4530 340 1641 1313 2055 1079 733 271
14 Ref6 5146 303 887 733 1892 921 787 369
15 Ref7 4310 333 818 3384 3720 2080 1011 318
16 Ref8 4986 349 1118 1485 2490 2185 720 337
# cell percentages per 8-metacluster per condition
cells_per_cluster <- sweep(dcast(data.frame(condition=colData(sce)$condition, cluster_id=cluster_ids(sce, "meta8")), condition~cluster_id)[,2:9], 1, 0.01*c(sum(n_cells(sce)[1:8]), sum(n_cells(sce)[9:16])), "/")
rownames(cells_per_cluster) <- unique(colData(sce)$condition)
colnames(cells_per_cluster) <- paste0("cluster", 1:8)
cells_per_cluster cluster1 cluster2 cluster3 cluster4 cluster5 cluster6 cluster7 cluster8
BCRXL 33.17051 4.088069 8.592296 16.84194 10.05655 20.55772 4.425975 2.266934
Ref 37.64924 2.573263 11.076565 12.91719 15.44297 11.85442 6.210429 2.275914
Clusters 2, 4, and 6 are more frequent in the stimulated samples as compared to the reference samples, while clusters 1, 3, 5 and 7 are less frequent. Cluster 8 is almost equally frequent in both conditions.
Question 5: Dimensionality reduction
We see that cells are more or less spread homogeneously across all patients, so batch effects across patients can be neglected.
We see a clear rightward shift of the stimulated samples as compared to the reference samples in the right half of the UMAP plot, with a slight upward shift also seen in the left half of the plot.
Question 6: Exploratory data analysis
All 8 pairs of samples (each pair collected from the same patient) are well-separated in the first dimension. Additionally, the pair of samples from patient 6 is also well separated in the second dimension.
All BCRXL-stimulated samples are on the left half of the MDS plot, well separated in the first dimension from the reference samples which are all on the right half of the plot. However, the BCRXL6 sample is somewhat close to the reference samples in the first dimension. Additionally, the reference samples are more spread out within themselves as compared to the BCRXL-stimulated samples.
Question 7: Differential state (DS) analysis
ei <- ei(sce)
# linear mixed model with a random intercept for each patient
ds_formula1 <- createFormula(ei, cols_fixed="condition", cols_random="patient_id")
# regular linear model
ds_formula2 <- createFormula(ei, cols_fixed="condition")
# create contrasts between conditions
contrast <- createContrast(c(0,1))
# run diffcyt for the two models at the 8-metacluster level
res_ds1 <- diffcyt(sce, ei, formula=ds_formula1, contrast=contrast, analysis_type="DS", method_DS="diffcyt-DS-LMM", clustering_to_use="meta8")
res_ds2 <- diffcyt(sce, ei, formula=ds_formula2, contrast=contrast, analysis_type="DS", method_DS="diffcyt-DS-LMM", clustering_to_use="meta8")
# test results for the two models
(tbl_ds1 <- rowData(res_ds1$res))DataFrame with 112 rows and 4 columns
cluster_id marker_id p_val p_adj
<factor> <factor> <numeric> <numeric>
1 1 pNFkB 2.44332e-07 8.55162e-07
2 2 pNFkB 3.10957e-08 1.28990e-07
3 3 pNFkB 2.68674e-14 2.31473e-13
4 4 pNFkB 4.77849e-11 2.81679e-10
5 5 pNFkB 3.28304e-09 1.53209e-08
... ... ... ... ...
4 4 pS6 3.16311e-05 9.08379e-05
5 5 pS6 1.58638e-03 3.52018e-03
6 6 pS6 2.87609e-07 9.76128e-07
7 7 pS6 0.00000e+00 0.00000e+00
8 8 pS6 1.14738e-03 2.79361e-03
DataFrame with 112 rows and 4 columns
cluster_id marker_id p_val p_adj
<factor> <factor> <numeric> <numeric>
1 1 pNFkB 1.44285e-04 0.000850520
2 2 pNFkB 9.50763e-04 0.003600878
3 3 pNFkB 2.05407e-04 0.001023640
4 4 pNFkB 1.23368e-05 0.000151000
5 5 pNFkB 3.75724e-05 0.000350676
... ... ... ... ...
4 4 pS6 1.20918e-03 4.23214e-03
5 5 pS6 1.07569e-02 3.34660e-02
6 6 pS6 5.75568e-04 2.47937e-03
7 7 pS6 3.68328e-12 4.12527e-10
8 8 pS6 1.00081e-02 3.20259e-02
There are a total of 8*14 = 112 tests, since there are 8 metaclusters and 14 state markers. Each test is to identify whether the given state marker is expressed differentially across conditions in the given metacluster of cells.
threshold <- 0.05
# number of DS findings for the two models at FDR <= 0.05
table(tbl_ds1$p_adj<=threshold)
FALSE TRUE
28 84
FALSE TRUE
70 42
When we account for patient variability, we identify many more DS markers (84 as compared to 42 without), thus, the sensitivity is doubled here.
Question 8: Visualizing DS results
We observe the expression of the top 50 DS markers (ordered by FDR) across samples. We indeed see that the markers are expressed more in a certain cluster for the stimulated samples as compared to the reference samples (like pPlcg2 in cluster 7) and vice versa (like pBtk in cluster 1).
We now visualize a UMAP representation of the clusters at the 8-metacluster level to identify where each cluster lies, and then plot UMAPs of the specific clusters where a given marker is differentially expressed (colored by scaled expression and split by condition) for the top 10 hits.
top_hits <- topTable(res_ds1, top_n=50)
plots <- vector("list", 10)
for (i in 1:10) {
print(plotDR(sce[,cluster_ids(sce, "meta8") == top_hits$cluster_id[i]], color_by=as.character(top_hits$marker_id[i]), facet_by="condition") +
labs(title=paste0(top_hits$marker_id[i], "(", top_hits$cluster_id[i], ")"), x="UMAP1", y="UMAP2") +
theme(axis.text =element_blank()))
}We see that the differential expression of the respective DS markers is quite clear in the displayed clusters. For example, the pBtk marker is consistently expressed lower in the stimulated samples as compared to the reference samples in clusters 1, 3, 4, 6, 7, and 8, while the pPlcg2 and pS6 markers are expressed higher in the stimulated samples as compared to the reference samples in clusters 7 and 3, respectively.
Runtime in seconds: